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Abstract- Applying distribution in the form of agents technology and improving ^fcci^on rule based 
data mining algorithms, agents are the best for doing the continuous data minirfgVfilciently reducing 
network load and carrying the code to remote locations and the types of mobi^ ^ gwt which having the 
energy of moving from one place to another itself ,and interacting with the^J^^rmining agents for the 
purpose of data mining as component based communication, so working witl^fcmmunication agents for 
improvement of apriori algorithm for quality. By proposing architectuf ^SrVrfprovement of apriori mining 
in distributed environment, this architecture can facilitate th^|A^SR of instantaneous real time 
messaging; this study includes the issue of constantly execi^rHjNqueries on continuous data in 
heterogeneous platforms or environments. 

I. Introduct^^ ♦ 

The Apriori Algorithm is used to find Frequent Item^SpVFISs) Using Candidate Generation Apriori is an 
influential algorithm for mining FIS for Boolean a«Acfaxion rules. The name of the algorithm is based on 
the fact that the algorithm uses prior knowledg^^TTS properties. Apriori utilizes an iterative approach 
known as a level-wise search, where k-item srf^re used to explore (k+i)-item sets. First, the set of frequent 
l-item sets is found. This set is denoted LfL^is used to find L2, the set of frequent 2-item sets, which is 
used to find L3, and so on, until no mor^fraquent k-item sets can be found. The finding of each Lk requires 
one full scan of the local relationalM?kira?es. With more importance of bigger data files distributed over 
wideareanetwork (WAN), whate mutations of efficient bandwidth and software tools, forced the 
development of distributed da^^pining (DDM). DDM is expected to partial analyze data partially at 
individual sites and then t^^^m the outcome as partial result to other sites where it is, Partially at 
individual sites and theniw&sia the outcome as partial result to other sites where it is sometimes required 
to be unified for achjgA*gJglobal result. In order to support distributed architectures, there are two 
architectures mainj^t <Qpfit server and agents communicative agents. In this project we explore the 
capabilities of rfW^eJgents in a DDM. 

Consider thel^^pical situations where, central data server that has to collect data from several computers; 
like WW^^emrch engine collects data from web servers all over the world. Central information server in 
rc^nmany collects data from different departments and data mining on a large distributed database [1, 
Iventional Data Collection (CDC), Decentralized Data Collection (DDC), Data collection with 
wnication agents (DCCM) are three preferred solutions. CDC collects data from several computers 
with the following Disadvantages like the central server needs all the data from the other computers before 
it can do some processing and, in DDC all computers run a kind of distributed search engine, for example 
Harvest. The local search engines process data locally and transfer the results to central server. And 
Disadvantages are, lot of maintenance for the local search engines is needed, when a new version of the 
search engine comes up it must be installed on every local server. Where as in DCCM, with the help of 
communication agents (CA), it travels around the distributed environment for mining. At each terminal it 
processes the data and sends the results back to the central server provides Low network traffic because the 
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agents do data local processing [6]. For identifying better algorithm and suitable mining architecture in 
distributed environments. The objective of the paper is to develop architecture to support Distributed Data 
Mining (DDM), which can be used to extract hidden predictive information from large databases, 
henceforth will be a great potential use to help companies. The focus is on the most important information 
on their data repositories. 

The existing data mining algorithms for distributed data are of communication intensive [2]. Many 
algorithms for data mining have been proposed for a data at a local host based data repository, and applied 
some of them are implemented at multiple locations with little bit improvement, in terms of efficierd^^f 
these algorithms as a part of quality but complexity of algorithms are not efficient in disAj^ted 
environment are not addressed, as data on the web/network are distributed by very of its ru^ure^ As a 
consequence, both new architectures and new algorithms are needed to merge together. ^>^^ 

Mobile agents carries state attributes and code which defines agent's behavior likefwhe\and where to 
move , what to process there , special type of agents is 'communication agent 'cour3^^e^sages back and 
forth between clients residing on various network nodes as multi server multi cJi£Vtg/k asec ' computation 
environment. So distributed data mining is performed efficiently with the ag^^^tion of agents, where 
communication in between these agents achieved in the following mltpVlology. The control of 
communication agents are managed by using java classes of Aglet tool k^sern^ important calls are briefed, 
those are Future Reply Class Evaluates Weather a Reply will Be Give* w^ylessage, an Aglet Can Perform 
another Task while by calling proxy. sendAsyncMessage; Any agleti^l^ants to communicate with other 
aglets has to first obtain the proxy object. And use the followfcgjfcalls. getAgletlnfo, communication 
achieved by exchanging aglets of Message class. Agletproxy ClaL™ Responsible for Sending and receiving 
by sendMessage of Aglet proxy class. 




Association Rule Mining: In data mining, associatiqjvrale Learning is a popular and well researched method 
for discovering interesting relations betweejwtfariables in large databases [6, 7 and 8]. It analyzes and 
present strong rules discovered in datab^HyjSing different measures of interestingness. Based on the 
concept of strong association rules for drfSfceSering regularities between products in large scale transaction 
data recorded by point - of - sale (P^^^*rems in supermarkets are introduced. 

For example, the rule found irj^lP sales data of a supermarket would indicate that if a customer buys 
onions and potatoes togethe^JwJr she is likely to also buy burger. Such information can be used as the 
basis for decisions about j^al^gling activities such as, e.g., promotional pricing or product placements. In 
addition to the above esSmJle from market basket analysis association rules are employed today in many 
application areas incfu^hg Web usage mining, intrusion detection and bioinformatics. Three parallel 
algorithms for Hmrfnaassociation rules, an important data mining problem is formulated in this paper [3]. 
These algoritru^^^a've been designed to investigate and understand the performance implications of a 
spectrum of rade^offs between computation, communication, memory usage, synchronization, and the use 
of problejn«flecific information in parallel data mining. Fast Distributed Mining of association rules, which 
genera|e's^sTnall number of candidate sets and substantially reduces the number of messages to be passed 
aJ^nrfcJbg'assoriation rules. Algorithms for mining association rules from relational data have been well 
d^^laped. Several query languages have been proposed, to assist association rule mining. The topic of 
mining XML data has received little attention, as the data mining community has focused on the 
development of techniques for extracting common structure from heterogeneous XML data. For instance, 
[14] has proposed an algorithm to construct a frequent tree by finding common sub trees embedded in the 
heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to 
represent the knowledge extracted from the data using XML. JAM has been developed to gather 
information from sparse data sources and induce a global classification model. The PADMA system is a 
document analysis tool working on a distributed environment, based on cooperative agents. It works 
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without any relational database underneath. Instead, there are PADMA agents that perform several 
relational operations with the information extracted from the documents. 

An association rule is a rule which implies certain association relationships among a set of objects such as 
occur together" or ' ' one implies the other") in a database. Given a set of transactions, where each 
transaction is a set of literals (called items), an association rule is an expression of the form X Y, where X and 
Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X 
tend to contain Y Association rule mining(ARM) is one of the data mining technique used to extract hidden 
knowledge from datasets that can be used by an organizations decision makers to improve overall pjoftf^k 

III. Apriori Analyses in Distributed Data Mining + 

However the data parallelism algorithms need more memory at each remote site for stori^tf^fljjcandidates 
for each scan, performance will be degraded if not providing much memory, "tjae ragk parallelism 
algorithms can avoid this type of degrading. The task distribution may work where*!^^ distribution may 
not work [4]. Henceforth estimated approach that is using mobile agents for tag^J^tribution will give 
efficient results. Investigating suitability of Apriori algorithm for parallel anjjurJ^i, proposed 4 parallel 
algorithms based on Apriori; speed up mining of frequent item sets. C/ 

Fourth type The Candidate Distribution Mining (CDM) algorithm pe the task of generating longer 

patterns and load balancing algorithm that reduces synchronizatiorJ^^$een the processors and segments. 
The database is based upon different transaction patterns. Theje^a^illel algorithms were tested among 
each other and CD had the best performance against the AprioV^fcorithm. Its overhead is less than 7.5% 
when compared with Apriori by tried to make DDM andVlDJVI scalable by Hybrid Distribution (HD) 
algorithm respectively.CDM addresses the issues of rt^Cminication solves overhead and redundant 
computation in by using aggregate memory to pafc&m candidates and move data efficiently. CDM 
improves over by dynamically partitions the candrt^t* set to maintain good load balance. Experiment 
output results show that the response time of CBjjvfy4.4 times less than DDM on a 32-processors system 
and HD is 9.5% better than CDM on 128 proae^ors. the following graph shows comparison of the running 
time of above four algorithms.; so it is con^Ta^ea as stable in DDM. and the proposed algorithm is given in 
Figure 3.1. 

. Apriori Algorithm 



«6 



An association rule mining^ri^sichm, Apriori has been developed for rule mining in large transaction 
databases by IBM's Quest^^^l team. An itemset is a non-empty set of items. 

They have decompose^^^e problem of mining association rules into two parts 

• Find aJJ^^roinations of items that have transaction support above minimum support. Call those 
comCT|atTbns frequent itemsets. 

• U^Jl*e frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and 
Trolrre frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r = 
WTpport (ABCD) /support (AB). The rule holds only if r >= minimum assurance. Note that the rule 
will have minimum support because ABCD is frequent. The algorithm is highly scalable. The 
Apriori algorithm used in Quest for finding all frequent itemsets is given below. 



ALGORITHM Apriori_gen ():- 

For all agents if (agents=true) { 
Ii =item set A; l2=item set B; 
Procedure combine () { 
For all up to k-i 
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Select p.itemi... Ii.itemk-i, l2.itemk-i from Lk-ih, Lk-il2 
where Ii.itemi = l2.itemi and Ii.itemk-2 = l2.itemk-2 and Ii.itemk-i < l2.itemk-i ;}} 

If (agent=false) 

// eliminate item sets such that some (k-i)-subset not in Li s l for all c in Ck 
Procedure eliminate () { 

For all (k-i)-subsets s of c if (s is not in Li s l ) {eliminate c from q s ; break; } } 



Figure 1: Algorithm for Apriori_Agent 



It makes numerous passes over the database. In the first pass, the algorithm basically camnt^ item 
occurrences to determine the frequent l-itemsets (itemsets with 1 item). A successive pasiCs^ppass k, 
consists of two phases. First, the frequent itemsets Ik-i (the set of all frequent (k-i ) -i tern sets WflQpfl in the (k- 
i) Th pass are used to generate the candidate itemsets Ck, using the apriori-gen () functi^i. Tn^ function first 
joins Lk_! with 1^, the joining state being that the lexicographically ordered first kdl^^rrfs are the same. 
Next, it deletes all those itemsets from the join result that have some (k-i)-subsetlH^^ not in I^ yielding 
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The algorithm now scans the database. For each transaction, it determine^^i^nof the candidates in Ck are 
contained in the transaction using a hash-tree data structure and incnenQI|£s#the count of those candidates. 
At the end of the pass, Ck is examined to determine which of tha^»Sidates frequent, yielding Ik. The 
algorithm terminates when Ik becomes empty. 



IV. Architecture for Web Services in Ifcstrjbuted Data Mining 

The Proposed architecture uses XML standards and w«k\ervices which are important [5]. They keep data 
mining algorithms as web services, which are in\J0rafcl and utilized by different knowledge discovery 
applications located in distributed, locations. ^^*iain components and the functionalities of these 
components are described in the sequential oj 



Describing components from top-left th^^l>bot is a very fastest and reliable web walker with support for 
regular expressions, sql logging fil^*,q£nXjjliousdata, web log files collected in parallel.webserver Log files 
are downloaded and sessionizer generates a LOGML file, Integration Engine(IE) which is the part of Data 
Warehouse (DW) functionality^wtich is suited for preprocessing of data at remote sites, after integration 
finally loading into database^M^Uater generating patterns in the form of graphs, User sessions from web 
logs are extracted for stu?J^A^tad analyzing sequences of similarity, so distribute mined by using DDM, 
Here using the logic in fahy following manner, frequent contiguous sequences with a given minimum 
support. These are imp^foed into a database, the minimal frequent sequences are suppressed, and Different 
queries are faced ^flgontftms. Apriori and AprioriTid as combination of Apriori _agent (); against this data 
according to sruj^^Hreria of minsupport of each pattern. Different fragmented results obtained by different 
communicant agents are to be merged for obtaining target output. Predictive Model Markup Language 
(PMML) js^^KML-based language which follows a very intuitive structure to describe data pre- and post- 
processings well passing models as input to the another algorithms .PMML is used to transform raw data 
iiij^J^jnngful features It wouldn't be complete to describe web services without mentioning the SOAP 
p\toaDl [10, 11 and 12]. SOAP is not really simple protocol and "object" has nothing to do with the protocol, 
there is no importance to understand SOAP as it is transparent to you unless you deal with related low -level 
programming. The web service architecture is shown in the Figure 2. 
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Figure 2: Web Service Architecture and related standards /"^^^ 



V. Implementation with Results 



The implementation of Apriori_Agent () algorithm with Aglet class,»%/le&M%ssageclass and Aglet Proxy 
classes and writing.. \cnAaglets.props and build XML files, build st^^^lr after running ANT command; 
Tahiti server automatically started after running C:\aglets\bin\agtfeN^-f.\ aglets, props command, now 
dispatching the mining aglet to another host, the following gra^rtw user interface created and our data 
mining agent ready to do apriori mining on the Web Server LogCiles which are downloaded and processed 
through a sessionizer and save that file as LOGML ,anothteAp?t located in distributed environment may 
asked to supply support values and the task of apriori ^h»ng is completed at remote host as shown in 
Figure 3 in the similar way any remote host supply da^^w like log files, weblogs,LOGMLs,data.txt as input 
to our Distributed Data Mining (DDM) with the hefc|\f aglets which are event based mechanisms. 




Figure 3: clicking on the run button to get the output 
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Aglet Mobility View Options Tools Help 
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Figure 4: Aglet disp. 



J to another host 



The architecture sends the agent to another^ervef . After dispatching agent, new agent on the current 
server does no more exists on your server. Wl^pVotocol for the destination URL is Agent Transfer Protocol 
(ATP). A dispatched agent from a remo^^^rver. First specify the target server and you will get a list of 
agents on the target server. Then, *crf^j/ specify one of the aglets form the server. The architecture of 
distributed data mining will act on^s^^ Warehouses located in remote locations. The architecture works 
on DM with different heterogenics data. The aglet dispatch to another host is shown in the Figure 4. 

\SO VI. Conclusion 

This paper proposes tf^Nprutions to the issue of knowledge discovery in distributed data mining with less 
complexity, an^rj^aNilrrm mobile agents with less exchange of data distribution, Mobile agents are used for 
candidate distrij^^ff by conducting the above experiments, proved that mobile agents can be as used for 
candidate dWffebwtion in distributed data mining efficiently. And security of communication agents for 
reducing nfci^ssing and storage issues , so some research has to be done in these areas or simplify the use 
of the jfltJrMnisms available like some tools, investigation to be carried out in the area of security design 
uthors are working in these areas. 
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